1,160 research outputs found
Efficiency of propensity score adjustment and calibration on the estimation from non-probabilistic online surveys
One of the main sources of inaccuracy in modern survey techniques, such as online and smartphone surveys, is the absence of an adequate sampling frame that could provide a probabilistic sampling. This kind of data collection leads to the presence of high amounts of bias in final estimates of the survey, specially if the estimated variables (also known as target variables) have some influence on the decision of the respondent to participate in the survey. Various correction techniques, such as calibration and propensity score adjustment or PSA, can be applied to remove the bias. This study attempts to analyse the efficiency of correction techniques in multiple situations, applying a combination of propensity score adjustment and calibration on both types of variables (correlated and not correlated with the missing data mechanism) and testing the use of a reference survey to get the population totals for calibration variables. The study was performed using a simulation of a fictitious population of potential voters and a real volunteer survey aimed to a population for which a complete census was available. Results showed that PSA combined with calibration results in a bias removal considerably larger when compared with calibration with no prior adjustment. Results also showed that using population totals from the estimates of a reference survey instead of the available population data does not make a difference in estimates accuracy, although it can contribute to slightly increment the variance of the estimator
Propensity score adjustment using machine learning classification algorithms to control selection bias in online surveys
Modern survey methods may be subject to non-observable bias, from various sources.
Among online surveys, for example, selection bias is prevalent, due to the sampling mechanism commonly used, whereby participants self-select from a subgroup whose characteristics differ from those of the target population. Several techniques have been proposed to
tackle this issue. One such is Propensity Score Adjustment (PSA), which is widely used and
has been analysed in various studies. The usual method of estimating the propensity score
is logistic regression, which requires a reference probability sample in addition to the online
nonprobability sample. The predicted propensities can be used for reweighting using various estimators. However, in the online survey context, there are alternatives that might outperform logistic regression regarding propensity estimation. The aim of the present study is
to determine the efficiency of some of these alternatives, involving Machine Learning (ML)
classification algorithms. PSA is applied in two simulation scenarios, representing situations
commonly found in online surveys, using logistic regression and ML models for propensity
estimation. The results obtained show that ML algorithms remove selection bias more effectively than logistic regression when used for PSA, but that their efficacy depends largely on
the selection mechanism employed and the dimensionality of the data.This study was partially supported by
Ministerio de Economía y Competitividad, Spain
[grant number MTM2015-63609-R] and, in terms
of the first author, a FPU grant from the Ministerio
de Ciencia, Innovacio´n y Universidades, Spain. The
funders had no role in study design, data collection
and analysis, decision to publish, or preparation of
the manuscript
Variable selection in Propensity Score Adjustment to mitigate selection bias in online surveys
The development of new survey data collection methods such as online surveys has
been particularly advantageous for social studies in terms of reduced costs, immediacy
and enhanced questionnaire possibilities. However, many such methods are strongly
affected by selection bias, leading to unreliable estimates. Calibration and Propensity
Score Adjustment (PSA) have been proposed as methods to remove selection bias in
online nonprobability surveys. Calibration requires population totals to be known for
the auxiliary variables used in the procedure, while PSA estimates the volunteering
propensity of an individual using predictive modelling. The variables included in
these models must be carefully selected in order to maximise the accuracy of the final
estimates. This study presents an application, using synthetic and real data, of variable
selection techniques developed for knowledge discovery in data to choose the best
subset of variables for propensity estimation.We also compare the performance of PSA
using different classification algorithms, after which calibration is applied. We also
present an application of this methodology in a real-world situation, using it to obtain
estimates of population parameters. The results obtained show that variable selection
using appropriate methods can provide less biased and more efficient estimates than
using all available covariatesMinisterio de Ciencia e Innovación, Spain [Grant No. PID2019-106861RBI00/AEI/10.13039/501100011033].
FPU grant from Ministerio de Ciencia, Innovación y Universidades.
Funding for open access charge: Universidad de Granada / CBUA Spain.
IMAG-Maria de Maeztu CEX2020-001105-M/AEI/10.13039/50110001103
Inference from Non-Probability Surveys with Statistical Matching and Propensity Score Adjustment Using Modern Prediction Techniques
We would like to thank the anonymous referees for their remarks and comments that have
improved the presentation and the contents of the paper.Online surveys are increasingly common in social and health studies, as they provide
fast and inexpensive results in comparison to traditional ones. However, these surveys often
work with biased samples, as the data collection is often non-probabilistic because of the lack
of internet coverage in certain population groups and the self-selection procedure that many online
surveys rely on. Some procedures have been proposed to mitigate the bias, such as propensity score
adjustment (PSA) and statistical matching. In PSA, propensity to participate in a nonprobability
survey is estimated using a probability reference survey, and then used to obtain weighted estimates.
In statistical matching, the nonprobability sample is used to train models to predict the values of
the target variable, and the predictions of the models for the probability sample can be used to
estimate population values. In this study, both methods are compared using three datasets to
simulate pseudopopulations from which nonprobability and probability samples are drawn and used
to estimate population parameters. In addition, the study compares the use of linear models and
Machine Learning prediction algorithms in propensity estimation in PSA and predictive modeling
in Statistical Matching. The results show that statistical matching outperforms PSA in terms of bias
reduction and Root Mean Square Error (RMSE), and that simpler prediction models, such as linear
and k-Nearest Neighbors, provide better outcomes than bagging algorithms.Ministerio de Economia, Industria y Competitividad, Spain
MTM2015-63609-RMinisterio de Ciencia, Innovacion y Universidades, Spain
FPU17/0217
Evaluating Machine Learning methods for estimation in online surveys with superpopulation modeling
Online surveys, despite their cost and effort advantages, are particularly prone to selection bias due to the differences between target population and potentially covered population (online population). This leads to the unreliability of estimates coming from online samples unless further adjustments are applied. Some techniques have arisen in the last years regarding this issue, among which superpopulation modeling can be useful in Big Data context where censuses are accessible. This technique uses the sample to train a model capturing the behavior of a target variable which is to be estimated, and applies it to the nonsampled individuals to obtain population-level estimates. The modeling step has been usually done with linear regression or LASSO models, but machine learning (ML) algorithms have been pointed out as promising alternatives. In this study we examine the use of these algorithms in the online survey context, in order to evaluate and compare their performance and adequacy to the problem. A simulation study shows that ML algorithms can effectively volunteering bias to a greater extent than traditional methods in several scenarios.Ministerio de Economía y Competitividad, SpainMinisterio de Ciencia, Innovación y Universidades, Spai
The R package NonProbEst for estimation in non-probability surveys
Different inference procedures are proposed in the literature to correct selection bias that might be introduced with non-random sampling mechanisms. The R package NonProbEst enables the estimation of parameters using some of these techniques to correct selection bias in non-probability surveys. The mean and the total of the target variable are estimated using Propensity Score Adjustment, calibration, statistical matching, model-based, model-assisted and model-calibratated techniques. Confidence intervals can also obtained for each method. Machine learning algorithms can be used for estimating the propensities or for predicting the unknown values of the target variable for the non-sampled units. Variance of a given estimator is performed by two different Leave-One-Out jackknife procedures. The functionality of the package is illustrated with example data sets
Self-Perceived Health, Life Satisfaction and Related Factors among Healthcare Professionals and the General Population: Analysis of an Online Survey, with Propensity Score Adjustment
Healthcare professionals (HCPs) often suffer high levels of depression, stress, anxiety and burnout. Our main study aimswereto estimate the prevalences of poor self-perceived health, life dissatisfaction, chronic disease and unhealthy habits among HCPs and to explore the use of machine learning classification algorithms to remove selection bias. A sample of Spanish HCPs was asked to complete a web survey. Risk factors were identified by multivariate ordinal regression models. To counteract the absence of probabilistic sampling and representation, the sample was weighted by propensity score adjustment algorithms. The logistic regression algorithm was considered the most appropriate for dealing with misestimations. Male HCPs had significantly worse lifestyle habits than their female counterparts, together with a higher prevalence of chronic disease and of health problems. Members of the general population reported significantly poorer health and less satisfaction with life than the HCPs. Among HCPs, the prior existence of health problems was most strongly associated with worsening self-perceived health and decreased life satisfaction, while obesity had an important negative impact on female practitioners’ self-perception of health. Finally, the HCPs who worked as nurses had poorer self-perceptions of health than other HCPs, and the men who worked in primary care had less satisfaction with their lives than those who worked in other levels of healthcare.Ministerio de Ciencia e Innovación, Spai
Estimating General Parameters from Non-Probability Surveys Using Propensity Score Adjustment
This study introduces a general framework on inference for a general parameter using
nonprobability survey data when a probability sample with auxiliary variables, common to both
samples, is available. The proposed framework covers parameters from inequality measures and
distribution function estimates but the scope of the paper is broader. We develop a rigorous
framework for general parameter estimation by solving survey weighted estimating equations
which involve propensity score estimation for units in the non-probability sample. This development
includes the expression of the variance estimator, as well as some alternatives which are discussed
under the proposed framework. We carried a simulation study using data from a real-world survey,
on which the application of the estimation methods showed the effectiveness of the proposed
design-based inference on several general parameters.Spanish Government
MTM2015-63609-RInstituto de Salud Carlos III
Spanish Government
PID2019-106861RB-I00-AEI-10.13039/50110001103
Combining Statistical Matching and Propensity Score Adjustment for Inference from Non-Probability Surveys
The convenience of online surveys has quickly increased their popularity for data collection. However, this method is often non-probabilistic as they usually rely on selfselection procedures and internet coverage. These problems produce biased samples. In order to mitigate this bias, some methods like Statistical Matching and Propensity Score Adjustment (PSA) have been proposed. Both of them use a probabilistic reference sample with some covariates in common with the convenience sample. Statistical Matching trains a machine learning model with the convenience sample which is then used to predict the target variable for the reference sample. These predicted values can be used to estimate population values. In PSA, both samples are used to train a model which estimates the propensity to participate in the convenience sample. Weights for the convenience sample are then calculated with those propensities. In this study, we propose methods to combine both techniques. The performance of each proposed method is tested by drawing nonprobability and probability samples from real datasets and using them to estimate population parameters.Ministerio de Economía y Competitividad of Spai
- …